Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus

Identifieur interne : 000D58 ( Main/Exploration ); précédent : 000D57; suivant : 000D59

N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus

Auteurs : Artur Šili [Croatie] ; Jean-Hugues Chauchat [France] ; Bojana Dalbelo Baši [Croatie] ; Annie Morin [France]

Source :

RBID : ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB

Abstract

Abstract: In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.

Url:
DOI: 10.1007/978-3-540-77002-2_56


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct:series">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus</title>
<author>
<name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
</author>
<author>
<name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
</author>
<author>
<name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
</author>
<author>
<name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-77002-2_56</idno>
<idno type="url">https://api.istex.fr/document/E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001D02</idno>
<idno type="wicri:Area/Istex/Curation">001B82</idno>
<idno type="wicri:Area/Istex/Checkpoint">000773</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Sili A:n:grams:and</idno>
<idno type="wicri:Area/Main/Merge">000D71</idno>
<idno type="wicri:Area/Main/Curation">000D58</idno>
<idno type="wicri:Area/Main/Exploration">000D58</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus</title>
<author>
<name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Croatie</country>
<wicri:regionArea>University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Unska 3, 10000 Zagreb</wicri:regionArea>
<wicri:noRegion>10000 Zagreb</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Croatie</country>
</affiliation>
</author>
<author>
<name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>Université de Lyon 2, Faculté de Sciences Economique et de Gestion, Laboratoire Eric, 5 avenue Pierre Mendès France, 69676 Bron Cedex</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Bron</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Croatie</country>
<wicri:regionArea>University of Zagreb, Department of Electronics, Microelectronics, Computer and Intelligent Systems, KTLab, Unska 3, 10000 Zagreb</wicri:regionArea>
<wicri:noRegion>10000 Zagreb</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Croatie</country>
</affiliation>
</author>
<author>
<name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
<affiliation wicri:level="4">
<country xml:lang="fr">France</country>
<wicri:regionArea>Université de Rennes 1, IRISA, 35042 Rennes Cedex</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Région Bretagne</region>
<settlement type="city">Rennes</settlement>
</placeName>
<orgName type="university">Université de Rennes 1</orgName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB</idno>
<idno type="DOI">10.1007/978-3-540-77002-2_56</idno>
<idno type="ChapterID">56</idno>
<idno type="ChapterID">Chap56</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Croatie</li>
<li>France</li>
</country>
<region>
<li>Auvergne-Rhône-Alpes</li>
<li>Rhône-Alpes</li>
<li>Région Bretagne</li>
</region>
<settlement>
<li>Bron</li>
<li>Rennes</li>
</settlement>
<orgName>
<li>Université de Rennes 1</li>
</orgName>
</list>
<tree>
<country name="Croatie">
<noRegion>
<name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
</noRegion>
<name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
<name sortKey="Dalbelo Basi, Bojana" sort="Dalbelo Basi, Bojana" uniqKey="Dalbelo Basi B" first="Bojana" last="Dalbelo Baši">Bojana Dalbelo Baši</name>
<name sortKey="Sili, Artur" sort="Sili, Artur" uniqKey="Sili A" first="Artur" last="Šili">Artur Šili</name>
</country>
<country name="France">
<region name="Auvergne-Rhône-Alpes">
<name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
</region>
<name sortKey="Chauchat, Jean Hugues" sort="Chauchat, Jean Hugues" uniqKey="Chauchat J" first="Jean-Hugues" last="Chauchat">Jean-Hugues Chauchat</name>
<name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
<name sortKey="Morin, Annie" sort="Morin, Annie" uniqKey="Morin A" first="Annie" last="Morin">Annie Morin</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D58 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000D58 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:E6E28C59F7DED7FCC61B1B3D5A47E29371652FEB
   |texte=   N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024